Restoring normal master-slave roles after replication failure

Alex Balashov <abalashov@xxxxxxxxxxxxxxx> · Wed, 01 Apr 2015 13:22:37 -0400

Hello all,

I'm terribly sorry if this question has been asked before in a number of 
different forms; I wasn't able to find quite the nuance I was looking 
for in the archives.

We're running PostgreSQL 9.4 with streaming replication + hot_standby + 
 the newfangled physical replication slots. On the master, the 
wal_archive level is 'hot_standby', and on the slave, it's 'archive'[1].

As I have understood the conventional wisdom from scouring the 
documentation and list archives, if the master fails and we 
trigger-promote the slave to master and then, after fixing the master 
server, we need to restore it to the master role and restore the slave 
server back down to the hot_standby role, we need to reinitialise 
replication. That is, we need to do a brand new base backup of the slave 
back to the master.

My question is:

Isn't there a better way that lends itself to some degree of automation? 
We have a ~500 GB database and while we can do a base backup, it's 
neither pretty nor quick. Can we not configure $OLD_SLAVE as a master 
(with WAL senders, replication role and all), put $OLD_MASTER in 
recovery mode and have it play out the accumulated WALs in reverse from 
$NEW_MASTER, then shut down $NEW_MASTER, reconfigure both replicas to 
their appropriate roles, and bring them back up like before?

If this is possible, does it depend on setting wal_archive to 
'hot_standby' level on the slave during the course of normal operation, 
too? Shouldn't 'archive' be sufficient to restore from? What exactly 
does the 'hot_standby' WAL level contain that 'archive' doesn't?

If I'm wrong, is there no other way to do this except to re-do a whole 
base backup back to the master? How do people with very large databases 
deal with this? Just grin and bear it?

Many thanks in advance, and I apologise again if this has previously 
been answered in some canonical way.

-- Alex

[1] This is as it should be, right? We want to have backup WALs in case 
we experience a total physical meltdown of the master and need to do 
PITR, and/or to archive them to WAL-E from the slave if the slave were 
promoted to master.

--
Alex Balashov | Principal | Evariste Systems LLC
303 Perimeter Center North, Suite 300
Atlanta, GA 30346
United States

Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct)
Web: http://www.evaristesys.com/, http://www.csrpswitch.com/

--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin