1) When one node detects 'missed too many heartbeats', what decision-making process goes into effect towards the final outcome of fencing the node?
2) If a few nodes are down for maintenance, and they left the cluster with "remove" for adjustment of 'quorum' count, but not 'expected' count, how might this affect question #1?
It would be even more excellent If the responses could apply using our RHEL AS 4.5 11-node cluster as example:
$ cman_tool nodes
Node Votes Exp Sts Name
1 1 19 M db2
2 5 19 M net1
3 5 19 M net2
4 1 19 M db4
5 1 19 M db1
6 1 19 M db5
7 1 19 X app3
8 1 19 X app2
9 1 19 M app6
10 1 19 M db3
11 1 19 X net3
LVS network tier: net1 (5-votes), net2 (5-votes), net3 (remove)
Application tier: app2 (remove), app3 (remove), app6
Database tier: db1, db2, db3, db4, db5
Expected: 19, Quorum: 9, Total votes: 16
FYI: the nodes net3, app2, app3 left this cluster with "remove" to do some isolated testing of RHEL AS 4.6 update, but only net3 was left powered on. It was in this state for over a week.
As seen in syslog messages from each member that net1 went 'dark':
Mar 15 16:20:28 net2 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 16:20:29 net2 fenced[19273]: fencing deferred to db2
Mar 15 16:23:05 net2 clurgmgrd[20012]: <info> Magma Event: Membership Change
Mar 15 16:23:05 net2 clurgmgrd[20012]: <info> State change: net1 DOWN
Mar 15 12:29:16 app6 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 12:29:17 app6 fenced[19015]: fencing deferred to db2
Mar 15 12:31:53 app6 clurgmgrd[21831]: <info> Magma Event: Membership Change
Mar 15 12:31:53 app6 clurgmgrd[21831]: <info> State change: net1 DOWN
Mar 15 16:29:19 db1 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 16:29:20 db1 fenced[19297]: fencing deferred to db2
Mar 15 16:31:56 db1 clurgmgrd[21436]: <info> Magma Event: Membership Change
Mar 15 16:31:56 db1 clurgmgrd[21436]: <info> State change: net1 DOWN
Mar 15 16:29:19 db2 kernel: CMAN: removing node net1 from the cluster : Missed too many heartbeats
Mar 15 16:29:20 db2 fenced[14778]: net1 not a cluster member after 0 sec post_fail_delay
Mar 15 16:29:20 db2 fenced[14778]: fencing node "net1"
Mar 15 16:31:48 db2 ccsd[14677]: Attempt to close an unopened CCS descriptor (151704870).
Mar 15 16:31:48 db2 ccsd[14677]: Error while processing disconnect: Invalid request descriptor
Mar 15 16:31:48 db2 fenced[14778]: fence "net1" success
Mar 15 16:29:19 db3 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 16:29:20 db3 fenced[19097]: fencing deferred to db2
Mar 15 16:31:56 db3 clurgmgrd[21315]: <info> Magma Event: Membership Change
Mar 15 16:31:56 db3 clurgmgrd[21315]: <info> State change: net1 DOWN
Mar 15 16:29:19 db4 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 16:29:20 db4 fenced[19126]: fencing deferred to db2
Mar 15 16:31:56 db4 clurgmgrd[21182]: <info> Magma Event: Membership Change
Mar 15 16:31:56 db4 clurgmgrd[21182]: <info> State change: net1 DOWN
Mar 15 16:29:19 db5 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 16:29:20 db5 fenced[14508]: fencing deferred to db2
Mar 15 16:31:56 db5 clurgmgrd[17187]: <info> Magma Event: Membership Change
Mar 15 16:31:56 db5 clurgmgrd[17187]: <info> State change: net1 DOWN
It may be of no consequence, but also note that there was clock drift on net2, because of a failed NTP server; and also app6 because its clock was not calibrated after being down for a motherboard swapout and memory upgrade for a few weeks.
|
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster