Re: Broken replication in 3 node setup, master halt, standby 1 promoted, standby 2 - replication not working

Rui DeSousa <rui.desousa@xxxxxxxxxx> · Wed, 10 Jan 2018 17:39:01 -0500

It appears that your missing is following in the recovery.conf file.  It instructs the replica to follow a new timeline when presented; without it will refuse to follow the promoted node.

recovery_target_timeline=latest

Are you using sync replication? 

> On Jan 10, 2018, at 2:06 PM, Pritam Barhate <pritambarhate@xxxxxxxxx> wrote:
> 
> Hi,
> 
> I am trying to test the failover scenario on a brand new 3 node setup. 
> 
> pg1 is master
> pg2 and pg3 are standby hosts in streaming replication, both following pg1. 
> 
> When I halt pg1, promote pg2 and update recovery.conf on pg3 to point to pg2, the replication is broken. Can you please help me identify the correct procedure to achieve this failover? 
> 
> Here are the details:
> 
> Software Versions:
> 
> PostgreSQL: 9.6
> Ubuntu 16.04
> PgBackRest 1.25
> 
> Here is the scenario:
> 
> Set up 3 nodes in streaming replication:
> 
> master (pg1), standby 1 (pg2), standby 2 (pg3)
> 
> Also, I have set up pgbackrest, to back up from standby. It's working correctly.
> 
> Run the following on pg1:
> 
> sudo -u postgres psql -x -c "select * from pg_stat_replication;"
> 
> -[ RECORD 1 ]----+------------------------------
> pid              | 20714
> usesysid         | 16384
> usename          | replicator
> application_name | walreceiver
> client_addr      | 10.0.15.11
> client_hostname  | 
> client_port      | 48640
> backend_start    | 2018-01-10 18:17:46.105694+00
> backend_xmin     | 
> state            | streaming
> sent_location    | 0/6000060
> write_location   | 0/6000060
> flush_location   | 0/6000060
> replay_location  | 0/6000060
> sync_priority    | 0
> sync_state       | async
> -[ RECORD 2 ]----+------------------------------
> pid              | 20715
> usesysid         | 16384
> usename          | replicator
> application_name | walreceiver
> client_addr      | 10.0.15.12
> client_hostname  | 
> client_port      | 42198
> backend_start    | 2018-01-10 18:17:48.712575+00
> backend_xmin     | 
> state            | streaming
> sent_location    | 0/6000060
> write_location   | 0/6000060
> flush_location   | 0/6000060
> replay_location  | 0/6000060
> sync_priority    | 0
> sync_state       | async
> 
> 
> It reports both the standbys working correctly. 
> 
> Halt pg1. 
> 
> On pg2:
> 
> sudo -u postgres pg_ctlcluster 9.6 main promote
> 
> On pg3:
> 
> sudo service postgresql stop
> 
> sudo -u postgres vim /var/lib/postgresql/9.6/main/recovery.conf
> 
> modify it's contents are as follows:
> 
> ```
> primary_conninfo = 'host=pg2 port=5432 user=replicator'
> standby_mode = 'on'
> restore_command = '/usr/bin/pgbackrest --stanza=main archive-get %f "%p"'
> ```
> 
> sudo service postgresql start
> 
> On pg2:
> 
> sudo -u postgres psql -x -c "select * from pg_stat_replication;"
> 
> -[ RECORD 1 ]----+------------------------------
> pid              | 21238
> usesysid         | 16384
> usename          | replicator
> application_name | walreceiver
> client_addr      | 10.0.15.12
> client_hostname  | 
> client_port      | 44852
> backend_start    | 2018-01-10 18:22:20.980701+00
> backend_xmin     | 
> state            | startup
> sent_location    | 
> write_location   | 
> flush_location   | 
> replay_location  | 
> sync_priority    | 0
> sync_state       | async
> 
> As you can see sent_location, write_location, flush_location all are empty. 
> 
> On pg3:
> 
> tail -f /var/log/postgresql/postgresql-9.6-main.log 
> 2018-01-10 18:22:19.490 UTC [21229] LOG:  restored log file "000000010000000000000005" from archive
> 2018-01-10 18:22:19.497 UTC [21229] LOG:  redo starts at 0/5000028
> 2018-01-10 18:22:19.757 UTC [21243] postgres@postgres FATAL:  the database system is starting up
> 2018-01-10 18:22:20.188 UTC [21229] LOG:  restored log file "000000010000000000000006" from archive
> 2018-01-10 18:22:20.195 UTC [21229] LOG:  consistent recovery state reached at 0/7000000
> 2018-01-10 18:22:20.195 UTC [21228] LOG:  database system is ready to accept read only connections
> 2018-01-10 18:22:20.812 UTC [21229] LOG:  invalid magic number 0000 in log segment 000000010000000000000007, offset 0
> 2018-01-10 18:22:20.824 UTC [21255] LOG:  fetching timeline history file for timeline 2 from primary server
> 2018-01-10 18:22:20.828 UTC [21255] LOG:  primary server contains no more WAL on requested timeline 1
> 2018-01-10 18:22:21.417 UTC [21255] LOG:  primary server contains no more WAL on requested timeline 1
> 
> 
> At this point is I insert data in pg2, it doesn't reflect in pg3. 
> 
> Please help. 
> 
> Regards,
> 
> Pritam.
>