Re: Broken replication in 3 node setup, master halt, standby 1 promoted, standby 2 - replication not working

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Pritam Barhate,

Please check after promoting pg2 as master pg3 is up and running.
Provide Ip address of pg2 in recovery.conf of pg3 instead  hostname and pass recovery_target_timeline='latest'  ​
Also check archive_command in postgresql.conf of pg2 and restore_command in recovery.conf of pg3 is identical. and permission for pgbackrest.
​​

Hope this helps..

Best Regards,
Shreeyansh

On Thu, Jan 11, 2018 at 12:36 AM, Pritam Barhate <pritambarhate@xxxxxxxxx> wrote:
Hi,

I am trying to test the failover scenario on a brand new 3 node setup. 

pg1 is master
pg2 and pg3 are standby hosts in streaming replication, both following pg1. 

When I halt pg1, promote pg2 and update recovery.conf on pg3 to point to pg2, the replication is broken. Can you please help me identify the correct procedure to achieve this failover? 

Here are the details:

Software Versions:

PostgreSQL: 9.6
Ubuntu 16.04
PgBackRest 1.25

Here is the scenario:

Set up 3 nodes in streaming replication:

master (pg1), standby 1 (pg2), standby 2 (pg3)

Also, I have set up pgbackrest, to back up from standby. It's working correctly.

Run the following on pg1:

sudo -u postgres psql -x -c "select * from pg_stat_replication;"

-[ RECORD 1 ]----+------------------------------
pid              | 20714
usesysid         | 16384
usename          | replicator
application_name | walreceiver
client_addr      | 10.0.15.11
client_hostname  | 
client_port      | 48640
backend_start    | 2018-01-10 18:17:46.105694+00
backend_xmin     | 
state            | streaming
sent_location    | 0/6000060
write_location   | 0/6000060
flush_location   | 0/6000060
replay_location  | 0/6000060
sync_priority    | 0
sync_state       | async
-[ RECORD 2 ]----+------------------------------
pid              | 20715
usesysid         | 16384
usename          | replicator
application_name | walreceiver
client_addr      | 10.0.15.12
client_hostname  | 
client_port      | 42198
backend_start    | 2018-01-10 18:17:48.712575+00
backend_xmin     | 
state            | streaming
sent_location    | 0/6000060
write_location   | 0/6000060
flush_location   | 0/6000060
replay_location  | 0/6000060
sync_priority    | 0
sync_state       | async


It reports both the standbys working correctly. 

Halt pg1. 

On pg2:

sudo -u postgres pg_ctlcluster 9.6 main promote

On pg3:

sudo service postgresql stop

sudo -u postgres vim /var/lib/postgresql/9.6/main/recovery.conf

modify it's contents are as follows:

```
primary_conninfo = 'host=pg2 port=5432 user=replicator'
standby_mode = 'on'
restore_command = '/usr/bin/pgbackrest --stanza=main archive-get %f "%p"'
```

sudo service postgresql start

On pg2:

sudo -u postgres psql -x -c "select * from pg_stat_replication;"

-[ RECORD 1 ]----+------------------------------
pid              | 21238
usesysid         | 16384
usename          | replicator
application_name | walreceiver
client_addr      | 10.0.15.12
client_hostname  | 
client_port      | 44852
backend_start    | 2018-01-10 18:22:20.980701+00
backend_xmin     | 
state            | startup
sent_location    | 
write_location   | 
flush_location   | 
replay_location  | 
sync_priority    | 0
sync_state       | async

As you can see sent_location, write_location, flush_location all are empty. 

On pg3:

tail -f /var/log/postgresql/postgresql-9.6-main.log 
2018-01-10 18:22:19.490 UTC [21229] LOG:  restored log file "000000010000000000000005" from archive
2018-01-10 18:22:19.497 UTC [21229] LOG:  redo starts at 0/5000028
2018-01-10 18:22:19.757 UTC [21243] postgres@postgres FATAL:  the database system is starting up
2018-01-10 18:22:20.188 UTC [21229] LOG:  restored log file "000000010000000000000006" from archive
2018-01-10 18:22:20.195 UTC [21229] LOG:  consistent recovery state reached at 0/7000000
2018-01-10 18:22:20.195 UTC [21228] LOG:  database system is ready to accept read only connections
2018-01-10 18:22:20.812 UTC [21229] LOG:  invalid magic number 0000 in log segment 000000010000000000000007, offset 0
2018-01-10 18:22:20.824 UTC [21255] LOG:  fetching timeline history file for timeline 2 from primary server
2018-01-10 18:22:20.828 UTC [21255] LOG:  primary server contains no more WAL on requested timeline 1
2018-01-10 18:22:21.417 UTC [21255] LOG:  primary server contains no more WAL on requested timeline 1


At this point is I insert data in pg2, it doesn't reflect in pg3. 

Please help. 

Regards,

Pritam.




--
http://www.shreeyansh.com

[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux