Hello,
I have two node cluster with qdisk and rgmanager. When I kill aisexec on
the node1 (where the service BROKER is running) I get split brain
situation. The service BROKER is runing on both nodes. When I upgraded
rgmanager to the version from RH 5.5 BETA (rgmanager-2.0.52-3.el5) the
split brain doesn't occures because of the IP is on the node1 (rhbz#526647).
I think that rgmanager on node1 should handle this situation and stop
BROKER service when the aisexec is down.
The problem is because I use fence_scsi, and I would be the same with
any SAN fencing fe. fence_brocade.
---node1---
Mar 5 10:48:52 node1 clurgmgrd: [10813]: <info> Executing
/opt/webmeth/71_prodBroker/Broker/aw_broker71 status
Mar 5 10:49:14 node1 fenced[10361]: cluster is down, exiting
Mar 5 10:49:14 node1 gfs_controld[10373]: cluster is down, exiting
Mar 5 10:49:14 node1 dlm_controld[10367]: cluster is down, exiting
Mar 5 10:49:14 node1 kernel: dlm: closing connection to node 2
Mar 5 10:49:14 node1 kernel: dlm: closing connection to node 1
Mar 5 10:49:19 node1 qdiskd[10340]: <err> cman_dispatch: Host is down
Mar 5 10:49:19 node1 qdiskd[10340]: <err> Halting qdisk operations
Mar 5 10:49:25 node1 kernel: dlm: connect from non cluster node
Mar 5 10:49:42 node1 ccsd[10298]: Unable to connect to cluster
infrastructure after 30 seconds.
Mar 5 10:50:13 node1 ccsd[10298]: Unable to connect to cluster
infrastructure after 60 seconds.
Mar 5 10:50:43 node1 ccsd[10298]: Unable to connect to cluster
infrastructure after 90 seconds.
Mar 5 10:51:13 node1 ccsd[10298]: Unable to connect to cluster
infrastructure after 120 seconds.
Mar 5 10:51:43 node1 ccsd[10298]: Unable to connect to cluster
infrastructure after 150 seconds.
Mar 5 10:52:13 node1 ccsd[10298]: Unable to connect to cluster
infrastructure after 180 seconds.
Mar 5 10:52:43 node1 ccsd[10298]: Unable to connect to cluster
infrastructure after 210 seconds.
---node1---
---node2---
Mar 5 10:50:47 node1 clurgmgrd[20822]: <info> Waiting for node #1 to be
fenced
Mar 5 10:51:11 node1 fenced[8540]: node1 not a cluster member after 30
sec post_fail_delay
Mar 5 10:51:11 node1 fenced[8540]: fencing node "node1"
Mar 5 10:51:11 node1 fenced[8540]: fence "node1" success
Mar 5 10:51:13 node1 clurgmgrd[20822]: <info> Node #1 fenced; continuing
Mar 5 10:51:13 node1 clurgmgrd[20822]: <notice> Taking over service
service:BROKER from down member node1
Mar 5 10:51:13 node1 clurgmgrd: [20822]: <info> mounting
/dev/mapper/storage0-broker on /opt/webmeth/71_prodBroker/Broker/data
Mar 5 10:51:13 node1 kernel: kjournald starting. Commit interval 5 seconds
Mar 5 10:51:13 node1 kernel: EXT3 FS on dm-7, internal journal
Mar 5 10:51:13 node1 kernel: EXT3-fs: mounted filesystem with ordered
data mode.
Mar 5 10:51:13 node1 clurgmgrd: [20822]: <info> Adding IPv4 address
192.168.33.18/24 to bond0
Mar 5 10:51:13 node1 clurgmgrd: [20822]: <err> IPv4 address collision
192.168.33.18
Mar 5 10:51:13 node1 clurgmgrd[20822]: <notice> start on ip
"192.168.33.18/24" returned 1 (generic error)
Mar 5 10:51:13 node1 clurgmgrd[20822]: <warning> #68: Failed to start
service:BROKER; return value: 1
Mar 5 10:51:13 node1 clurgmgrd[20822]: <notice> Stopping service
service:BROKER
Mar 5 10:51:13 node1 clurgmgrd: [20822]: <info> Executing
/opt/webmeth/71_prodBroker/Broker/aw_broker71 stop
Mar 5 10:51:13 node1 clurgmgrd: [20822]: <info> unmounting
/opt/webmeth/71_prodBroker/Broker/data
Mar 5 10:51:13 node1 clurgmgrd[20822]: <notice> Service service:BROKER
is recovering
Mar 5 10:51:13 node1 clurgmgrd[20822]: <warning> #71: Relocating failed
service service:BROKER
Mar 5 10:51:13 node1 clurgmgrd[20822]: <notice> Service service:BROKER
is stopped
---node2---
I have tested this with cman from RH 5.5 (cman-2.0.115-29.el5) and cman
for RH 5.4 BETA (cman-2.0.115-1.el5_4.9).
Here is my config.
---cut---
<cluster alias="PROD-RH-CLUSTER-BROKER" config_version="5" name="PROD-BROKER">
<quorumd device="/dev/emcpowerb" interval="5" status_file="/root/qdiskstat" tko="8" votes="2">
<heuristic interval="5" program="ping 192.168.33.254 -c1 -t1" score="1" tko="6"/>
<heuristic interval="5" program="/usr/local/bin/smartTouch.sh /opt/webmeth/71_prodBroker/Broker/data" score="1" tko="6"/>
</quorumd>
<fence_daemon post_fail_delay="30" post_join_delay="120"/>
<cman expected_votes="6" two_node="0" broadcast="yes" quorum_dev_poll="35000"/>
<clusternodes>
<clusternode name="node1" nodeid="1" votes="2">
<fence>
<method name="1">
<device name="scsi3-pr" node="node1"/>
</method>
</fence>
</clusternode>
<clusternode name="node2" nodeid="2" votes="2">
<fence>
<method name="1">
<device name="scsi3-pr" node="node2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_scsi" name="scsi3-pr"/>
</fencedevices>
<rm log_facility="local4" log_level="7">
<failoverdomains>
<failoverdomain name="BROKER" ordered="1" restricted="1">
<failoverdomainnode name="node1" priority="1"/>
<failoverdomainnode name="node2" priority="2"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="192.168.33.18/24" monitor_link="1"/>
<script file="/opt/webmeth/71_prodBroker/Broker/aw_broker71" name="broker"/>
<fs device="/dev/mapper/storage0-broker" force_fsck="1" force_unmount="1" fsid="29845" fstype="ext3" mountpoint="/opt/webmeth/71_prodBroker/Broker/data" name="BROKER-FS" options="" self_fence="1"/>
</resources>
<service autostart="1" domain="BROKER" name="BROKER">
<fs ref="BROKER-FS"/>
<ip ref="192.168.33.18/24"/>
<script ref="broker"/>
</service>
</rm>
<totem consensus="4500" token="85000" token_retransmits_before_loss_const="20"/>
</cluster>
---cut---
Best Regards
Maciej Bogucki
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster