I we have several redhat cluster running cman-2.0.115-34.el5 and yesterday we faced and incident I cant figure it out: We have 2node cluster+qdisk connected through iscsi with dual paths we had some network issue and one of paths to the quorum disk was unavailable. after that I see some qdisk eviction messages and system reboot shortly after, followed by fencing from the 2nd node some seconds later That shouldn't be the correct behavior right? I have reboot=1 on qdisk settings but it's supposed to work on heuristics downgrade if I understood it right. I tried to simulate on the system but cant get this behavior, on the test systems nothing happens after qdisk unavailability witch is ok Im posting some info below, some hints on how to investigate would help. Disabling reboot should disable this anyway, yes? cluster.conf: <totem token="15000"/> <quorumd max_error_cycles="10" tko_up="2" master_wait="2" allow_kill="0" interval="2" label="OracleOne_Quorum" min_score="1" reboot="1" tko="20" votes="1" log_level="7" status_file="/tmp/Quorumstatus"> <heuristic interval="5" program="/bin/ping `cat /etc/sysconfig/network | awk -F '=' '/GATEWAY/ {print $2}'` -c1 -t2" score="1" tko="50"/> ........ <fence_daemon clean_start="0" post_fail_delay="60" post_join_delay="3"/> <cman expected_votes="3"/> cman_tool status: Version: 6.2.0 Config Version: 23 Cluster Name: OracleOne Cluster Id: 47365 Cluster Member: Yes Cluster Generation: 444 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Quorum device votes: 1 Total votes: 3 Quorum: 2 Active subsystems: 9 Flags: Dirty Ports Bound: 0 177 Node ID: 2 ..... cat /tmp/Quorumstatus Time Stamp: Tue Jan 18 11:54:17 2011 Node ID: 2 Score: 1/1 (Minimum required = 1) Current state: Running Initializing Set: { } Visible Set: { 1 2 } Master Node ID: 1 /var/log/messages: Jan 17 17:53:35 <kern.err> NODE2 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4554104323, last ping 4554109323, now 4554114323 Jan 17 17:53:35 <kern.info> NODE2 kernel: connection1:0: detected conn error (1011) Jan 17 17:53:36 <daemon.warn> NODE2 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Jan 17 17:53:36 <daemon.warn> NODE2 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Jan 17 17:53:37 <kern.err> NODE2 kernel: connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4554106362, last ping 4554111362, now 4554116362 Jan 17 17:53:37 <kern.info> NODE2 kernel: connection2:0: detected conn error (1011) Jan 17 17:53:38 <daemon.warn> NODE2 iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3) Jan 17 17:53:38 <daemon.warn> NODE2 iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3) Jan 17 17:53:42 <kern.info> NODE2 kernel: session2: session recovery timed out after 5 secs Jan 17 17:53:42 <kern.info> NODE2 kernel: sd 12:0:0:1: SCSI error: return code = 0x000f0000 Jan 17 17:53:42 <kern.warn> NODE2 kernel: end_request: I/O error, dev sdau, sector 8 Jan 17 17:53:42 <kern.warn> NODE2 kernel: device-mapper: multipath: Failing path 66:224. Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: dm-116: remove map (uevent) Jan 17 17:53:42 <daemon.warn> NODE2 multipathd: sdau: tur checker reports path is down Jan 17 17:53:42 <daemon.warn> NODE2 multipathd: sdau: tur checker reports path is down Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: checker failed path 66:224 in map mpath_iSCSI_qdisk Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: checker failed path 66:224 in map mpath_iSCSI_qdisk Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: mpath_iSCSI_qdisk: remaining active paths: 1 Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: mpath_iSCSI_qdisk: remaining active paths: 1 Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: dm-75: add map (uevent) Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: dm-75: add map (uevent) Jan 17 17:53:44 <local4.info> NODE2 openais[8952]: [CMAN ] lost contact with quorum device Jan 17 17:53:49 <daemon.warn> NODE2 qdiskd[8981]: <warning> qdiskd: read (system call) has hung for 20 seconds Jan 17 17:53:49 <daemon.warn> NODE2 qdiskd[8981]: <warning> qdiskd: read (system call) has hung for 20 seconds Jan 17 17:53:49 <daemon.warn> NODE2 qdiskd[8981]: <warning> In 20 more seconds, we will be evicted Jan 17 17:53:49 <daemon.warn> NODE2 qdiskd[8981]: <warning> In 20 more seconds, we will be evicted Jan 17 17:55:10 <kern.info> NODE2 kernel: md: stopping all md devices. ( Shutdown ) Jan 17 17:55:12 <kern.info> NODE2 kernel: bonding: bond2: link status down for interface eth0, disabling it in 2000 ms. 2nd node /var/log/messages: Jan 17 17:55:57 <kern.err> NODE1 kernel: dlm: closing connection to node 2 Jan 17 17:56:57 <daemon.info> NODE1 fenced[8982]: NODE2-cl not a cluster member after 60 sec post_fail_delay Jan 17 17:56:57 <daemon.info> NODE1 fenced[8982]: fencing node "NODE2-cl" -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster