Hello.
We was looking the ways to utilize Corosync/Pacemaker stack for creating a high-availability
cluster of PostgreSQL servers with automatic failover.
We are using
Corosync (2.3.4) as a messaging layer and a stateful master/slave
Resource Agent
(pgsql) with Pacemaker (1.1.12) on CentOS 7.1.
Things work
pretty well for a static cluster - where membership is defined up
front.
However, we
needed to be able to seamlessly add new machines (node) to the
cluster and remove
existing ones
from it, without service interruption. And we ran into a problem.
Is it possible to add a new node dynamically without interruption? Here are the steps we are using to add a node: # pcs property show Cluster Properties: cluster-infrastructure: corosync cluster-name: mycluster1 dc-version: 1.1.13-a14efad have-watchdog: false last-lrm-refresh: 1444042099 no-quorum-policy: stop stonith-action: reboot stonith-enabled: true Node Attributes: pi01: pgsql-data-status=STREAMING|SYNC pi02: pgsql-data-status=STREAMING|POTENTIAL pi03: pgsql-data-status=LATEST # pcs resource show --full
Group: master-group
Resource: vip-master
(class=ocf provider=heartbeat type=IPaddr2)
Attributes:
ip=192.168.242.100 nic=eth0 cidr_netmask=24
Operations: start
interval=0s timeout=60s on-fail=restart
(vip-master-start-interval-0s)
monitor
interval=10s timeout=60s on-fail=restart
(vip-master-monitor-interval-10s)
stop
interval=0s timeout=60s on-fail=block
(vip-master-stop-interval-0s)
Resource: vip-rep
(class=ocf provider=heartbeat type=IPaddr2)
Attributes:
ip=192.168.242.101 nic=eth0 cidr_netmask=24
Meta Attrs:
migration-threshold=0
Operations: start
interval=0s timeout=60s on-fail=stop
(vip-rep-start-interval-0s)
monitor
interval=10s timeout=60s on-fail=restart
(vip-rep-monitor-interval-10s)
stop
interval=0s timeout=60s on-fail=ignore
(vip-rep-stop-interval-0s)
Master: msPostgresql
Meta Attrs: master-max=1
master-node-max=1 clone-max=3 clone-node-max=1 notify=true
Resource: pgsql (class=ocf
provider=heartbeat type=pgsql)
Attributes:
pgctl=/usr/pgsql-9.5/bin/pg_ctl psql=/usr/pgsql-9.5/bin/psql
pgdata=/var/lib/pgsql/9.5/data/ rep_mode=sync
node_list="pi01 pi02 pi03" restore_command="cp
/var/lib/pgsql/9.5/data/wal_archive/%f %p"
primary_conninfo_opt="user=repl password=super-pass-for-repl
keepalives_idle=60 keepalives_interval=5 keepalives_count=5"
master_ip=192.168.242.100 restart_on_promote=true
check_wal_receiver=true
Operations: start
interval=0s timeout=60s on-fail=restart
(pgsql-start-interval-0s)
monitor
interval=4s timeout=60s on-fail=restart
(pgsql-monitor-interval-4s)
monitor
role=Master timeout=60s on-fail=restart interval=3s
(pgsql-monitor-interval-3s-role-Master)
promote
interval=0s timeout=60s on-fail=restart
(pgsql-promote-interval-0s)
demote
interval=0s timeout=60s on-fail=stop
(pgsql-demote-interval-0s)
stop
interval=0s timeout=60s on-fail=block
(pgsql-stop-interval-0s)
notify
interval=0s timeout=60s (pgsql-notify-interval-0s)
# pcs cluster auth pi01 pi02 pi03 pi05 -u hacluster -p hacluster pi01: Authorized
pi02: Authorized
pi03: Authorized
pi05: Authorized
# pcs cluster node add pi05
--start
pi01: Corosync updated
pi02: Corosync updated
pi03: Corosync updated
pi05: Succeeded
pi05: Starting Cluster...
# crm_mon -Afr1
Last updated: Fri Oct 2
16:59:54 2015 Last change: Fri Oct 2 16:59:23 2015
by hacluster via crmd on pi02
Stack: corosync
Current DC: pi02 (version
1.1.13-a14efad) - partition with quorum
4 nodes and 8 resources
configured
Online: [ pi01 pi02 pi03 pi05
]
Full list of resources:
Resource Group: master-group
vip-master
(ocf::heartbeat:IPaddr2): Started pi02
vip-rep
(ocf::heartbeat:IPaddr2): Started pi02
Master/Slave Set:
msPostgresql [pgsql]
Masters: [ pi02 ]
Slaves: [ pi01 pi03 ]
fence-pi01
(stonith:fence_ssh): Started pi02
fence-pi02
(stonith:fence_ssh): Started pi01
fence-pi03
(stonith:fence_ssh): Started pi01
Node Attributes:
* Node pi01:
+ master-pgsql
: 100
+ pgsql-data-status
: STREAMING|SYNC
+ pgsql-receiver-status
: normal
+ pgsql-status
: HS:sync
* Node pi02:
+ master-pgsql
: 1000
+ pgsql-data-status
: LATEST
+ pgsql-master-baseline
: 0000000008000098
+ pgsql-receiver-status
: ERROR
+ pgsql-status
: PRI
* Node pi03:
+ master-pgsql
: -INFINITY
+ pgsql-data-status
: STREAMING|POTENTIAL
+ pgsql-receiver-status
: normal
+ pgsql-status
: HS:potential
* Node pi05:
Migration Summary:
* Node pi01:
* Node pi03:
* Node pi02:
* Node pi05:
# crm_mon -Afr1
Last updated: Fri Oct 2
17:04:36 2015 Last change: Fri Oct 2 17:04:07 2015
by root via
cibadmin on pi01
Stack: corosync
Current DC: pi02 (version
1.1.13-a14efad) - partition with quorum
4 nodes and 9 resources
configured
Online: [ pi01 pi02 pi03 pi05
]
Full list of resources:
Resource Group: master-group
vip-master
(ocf::heartbeat:IPaddr2): Started pi02
vip-rep
(ocf::heartbeat:IPaddr2): Started pi02
Master/Slave Set:
msPostgresql [pgsql]
Masters: [ pi02 ]
Slaves: [ pi01 pi03 ]
Stopped: [ pi05 ]
fence-pi01
(stonith:fence_ssh): Started pi02
fence-pi02
(stonith:fence_ssh): Started pi01
fence-pi03
(stonith:fence_ssh): Started pi01
Node Attributes:
* Node pi01:
+ master-pgsql
: 100
+ pgsql-data-status
: STREAMING|SYNC
+ pgsql-receiver-status
: normal
+ pgsql-status
: HS:sync
* Node pi02:
+ master-pgsql
: 1000
+ pgsql-data-status
: LATEST
+ pgsql-master-baseline
: 0000000008000098
+ pgsql-receiver-status
: ERROR
+ pgsql-status
: PRI
* Node pi03:
+ master-pgsql
: -INFINITY
+ pgsql-data-status
: STREAMING|POTENTIAL
+ pgsql-receiver-status
: normal
+ pgsql-status
: HS:potential
* Node pi05:
+ master-pgsql
: -INFINITY
+ pgsql-status
: STOP
Migration Summary:
* Node pi01:
* Node pi03:
* Node pi02:
* Node pi05:
pgsql:
migration-threshold=1 fail-count=1000000 last-failure='Fri
Oct 2 17:04:13 2015'
Failed Actions:
* pgsql_start_0 on pi05
'unknown error' (1): call=27, status=complete,
exitreason='My data may be
inconsistent. You have to
remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
last-rc-change='Fri Oct
2 17:04:10 2015', queued=0ms, exec=2553ms
And here we fall into the trouble pgsql-status is now STOP!!!!!!!! # crm_mon -Afr1 Last updated: Fri Oct 2
17:07:05 2015 Last change: Fri Oct 2 17:06:37 2015
by root via cibadmin on pi01
Stack: corosync
Current DC: pi02 (version
1.1.13-a14efad) - partition with quorum
4 nodes and 9 resources
configured
Online: [ pi01 pi02 pi03 pi05
]
Full list of resources:
Resource Group: master-group
vip-master
(ocf::heartbeat:IPaddr2): Stopped
vip-rep
(ocf::heartbeat:IPaddr2): Stopped
Master/Slave Set:
msPostgresql [pgsql]
Slaves: [ pi02 ]
Stopped: [ pi01 pi03
pi05 ]
fence-pi01
(stonith:fence_ssh): Started pi02
fence-pi02
(stonith:fence_ssh): Started pi01
fence-pi03
(stonith:fence_ssh): Started pi01
Node Attributes:
* Node pi01:
+ master-pgsql
: -INFINITY
+ pgsql-data-status
: STREAMING|SYNC
+ pgsql-status
: STOP
* Node pi02:
+ master-pgsql
: -INFINITY
+ pgsql-data-status
: LATEST
+ pgsql-status
: STOP
* Node pi03:
+ master-pgsql
: -INFINITY
+ pgsql-data-status
: STREAMING|POTENTIAL
+ pgsql-status
: STOP
* Node pi05:
+ master-pgsql
: -INFINITY
+ pgsql-status
: STOP
Migration Summary:
* Node pi01:
* Node pi03:
* Node pi02:
* Node pi05:
Do you know the way to add new node to cluster without this disruption? Maybe some command or something else? -- Nikolay Popov n.popov@xxxxxxxxxxxxxx Postgres Professional: http://www.postgrespro.com The Russian Postgres Company |